AITopics | language variety

Collaborating Authors

language variety

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Whitehouse, Chenxi, Ruder, Sebastian, Lin, Tony, Kurylo, Oksana, Takagi, Haruka, Lam, Janice, Busetto, Nicolò, Diaz, Denise, Guzmán, Francisco

arXiv.org Artificial IntelligenceNov-12-2025

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.Dataset https://huggingface.co/datasets/facebook/menlo In order for LLMs to be most useful across the globe, they need to be able to provide high-quality responses in many languages. Responses should be relevant (Zhuang et al., 2024), factually accurate (Jacovi et al., 2025), and natural (Marchisio et al., 2024; Guo et al., 2025), among other considerations. Ultimately, for interaction in any language to be seamless, responses need to be indistinguishable from those of a native speaker (Novikova et al., 2016; Liu et al., 2021). Language proficiency in humans has traditionally been evaluated via standardized tests (Jamieson et al., 2000). While such tests have been applied to evaluating LLMs (Anil et al., 2023; Mayor-Rocher et al., 2024; Lothritz & Cabot, 2025), they are difficult to scale and do not readily correspond to real-world conversations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.26601

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.86)

Industry: Education > Assessment & Standards > Student Performance (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

Chen, William, Meng, Chutong, Shi, Jiatong, Bartelds, Martijn, Wang, Shih-Heng, Wang, Hsiu-Hsuan, Mosquera, Rafael, Hincapie, Sara, Jurafsky, Dan, Anastasopoulos, Antonis, Lee, Hung-yi, Livescu, Karen, Watanabe, Shinji

arXiv.org Artificial IntelligenceSep-10-2025

Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOT A) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOT A multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.07139

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.98)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mamasaidov, Mukhammadsaid, Aral, Azizullah, Shopulatov, Abror, Inomjonov, Mironshoh

arXiv.org Artificial IntelligenceAug-21-2025

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

artificial intelligence, natural language, southern uzbek, (16 more...)

arXiv.org Artificial Intelligence

2508.14586

Country:

North America > United States (0.94)
Asia > Afghanistan (0.70)

Genre: Research Report (0.40)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish

Phan, Nhan, Kuronen, Mikko, Kautonen, Maria, Ullakonoja, Riikka, von Zansen, Anna, Getman, Yaroslav, Voskoboinik, Ekaterina, Grósz, Tamás, Kurimo, Mikko

arXiv.org Artificial IntelligenceAug-21-2025

Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).

artificial intelligence, machine learning, pronunciation, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1375

2506.01156

Country: Europe > Finland (0.87)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback

TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

Riley, Parker, Shakeri, Siamak, Ammar, Waleed, Clark, Jonathan H.

arXiv.org Artificial IntelligenceJul-24-2025

We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models' abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2507.17709

Country:

Asia > Middle East (1.00)
Africa > North Africa (0.70)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

Roll, Nathan, Graham, Calbert, Tatsumi, Yuka, Nguyen, Kim Tien, Sumner, Meghan, Jurafsky, Dan

arXiv.org Artificial IntelligenceMay-22-2025

Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.14887

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Tjuka, Annika, Forkel, Robert, Rzymski, Christoph, List, Johann-Mattis

arXiv.org Artificial IntelligenceMar-14-2025

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2503.11377

Country:

Europe > Germany > Saxony > Leipzig (0.05)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(8 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.67)
Education (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation

Hutchinson, Ben, Louro, Celeste Rodríguez, Collard, Glenys, Cooper, Ned

arXiv.org Artificial IntelligenceMar-5-2025

In Australia, post-contact language varieties, including creoles and local varieties of international languages, emerged as a result of forced contact between Indigenous communities and English speakers. These contact varieties are widely used, yet are poorly supported by language technologies. This gap presents barriers to participation in civil and economic society for Indigenous communities using these varieties, and reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns three questions regarding this context. First, can speech technologies support speakers of Australian Aboriginal English, a local indigenised variety of English? Second, what risks are inherent in such a project? Third, what technology development practices are appropriate for this context, and how can researchers integrate meaningful community participation in order to mitigate risks? We argue that opportunities do exist -- as well as risks -- and demonstrate this through a case study exploring design practices in a real-world project aiming to improve speech technologies for Australian Aboriginal English. We discuss how we integrated culturally appropriate and participatory processes throughout the project. We call for increased support for languages used by Indigenous communities, including contact varieties, which provide practical economic and socio-cultural benefits, provided that participatory and culturally safe practices are enacted.

aboriginal english, computational linguistic, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2503.03186

Country:

Oceania > Australia > Western Australia (0.14)
Oceania > Australia > Queensland (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(20 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Industry:

Law (1.00)
Education (0.93)
Health & Medicine (0.93)
(2 more...)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Information Management > Search (0.93)

Add feedback

Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists

Snee, David, Ciucci, Luca, Rubehn, Arne, van Dam, Kellen Parker, List, Johann-Mattis

arXiv.org Artificial IntelligenceMar-1-2025

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

dataset, translation, wordlist, (16 more...)

arXiv.org Artificial Intelligence

2503.00464

Country:

Europe > Germany > Saxony > Leipzig (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Tradutor: Building a Variety Specific Translation Model

Sousa, Hugo, Almasian, Satya, Campos, Ricardo, Jorge, Alípio

arXiv.org Artificial IntelligenceFeb-20-2025

Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.14385

Country:

Europe > Austria > Vienna (0.14)
Europe > Switzerland (0.04)
Europe > Portugal > Porto > Porto (0.04)
(13 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback